Goto

Collaborating Authors

 app description


ACE: A Security Architecture for LLM-Integrated App Systems

arXiv.org Artificial Intelligence

LLM-integrated app systems extend the utility of Large Language Models (LLMs) with third-party apps that are invoked by a system LLM using interleaved planning and execution phases to answer user queries. These systems introduce new attack vectors where malicious apps can cause integrity violation of planning or execution, availability breakdown, or privacy compromise during execution. In this work, we identify new attacks impacting the integrity of planning, as well as the integrity and availability of execution in LLM-integrated apps, and demonstrate them against IsolateGPT, a recent solution designed to mitigate attacks from malicious apps. We propose Abstract-Concrete-Execute (ACE), a new secure architecture for LLM-integrated app systems that provides security guarantees for system planning and execution. Specifically, ACE decouples planning into two phases by first creating an abstract execution plan using only trusted information, and then mapping the abstract plan to a concrete plan using installed system apps. We verify that the plans generated by our system satisfy user-specified secure information flow constraints via static analysis on the structured plan output. During execution, ACE enforces data and capability barriers between apps, and ensures that the execution is conducted according to the trusted abstract plan. We show experimentally that ACE is secure against attacks from the InjecAgent and Agent Security Bench benchmarks for indirect prompt injection, and our newly introduced attacks. We also evaluate the utility of ACE in realistic environments, using the Tool Usage suite from the LangChain benchmark. Our architecture represents a significant advancement towards hardening LLM-based systems using system security principles.


Detecting Content Rating Violations in Android Applications: A Vision-Language Approach

arXiv.org Artificial Intelligence

Despite regulatory efforts to establish reliable content-rating guidelines for mobile apps, the process of assigning content ratings in the Google Play Store remains self-regulated by the app developers. There is no straightforward method of verifying developer-assigned content ratings manually due to the overwhelming scale or automatically due to the challenging problem of interpreting textual and visual data and correlating them with content ratings. We propose and evaluate a visionlanguage approach to predict the content ratings of mobile game applications and detect content rating violations, using a dataset of metadata of popular Android games. Our method achieves ~6% better relative accuracy compared to the state-of-the-art CLIP-fine-tuned model in a multi-modal setting. Applying our classifier in the wild, we detected more than 70 possible cases of content rating violations, including nine instances with the 'Teacher Approved' badge. Additionally, our findings indicate that 34.5% of the apps identified by our classifier as violating content ratings were removed from the Play Store. In contrast, the removal rate for correctly classified apps was only 27%. This discrepancy highlights the practical effectiveness of our classifier in identifying apps that are likely to be removed based on user complaints.


Getting Inspiration for Feature Elicitation: App Store- vs. LLM-based Approach

arXiv.org Artificial Intelligence

Over the past decade, app store (AppStore)-inspired requirements elicitation has proven to be highly beneficial. Developers often explore competitors' apps to gather inspiration for new features. With the advance of Generative AI, recent studies have demonstrated the potential of large language model (LLM)-inspired requirements elicitation. LLMs can assist in this process by providing inspiration for new feature ideas. While both approaches are gaining popularity in practice, there is a lack of insight into their differences. We report on a comparative study between AppStore- and LLM-based approaches for refining features into sub-features. By manually analyzing 1,200 sub-features recommended from both approaches, we identified their benefits, challenges, and key differences. While both approaches recommend highly relevant sub-features with clear descriptions, LLMs seem more powerful particularly concerning novel unseen app scopes. Moreover, some recommended features are imaginary with unclear feasibility, which suggests the importance of a human-analyst in the elicitation loop.


Detecting and Characterising Mobile App Metamorphosis in Google Play Store

arXiv.org Artificial Intelligence

App markets have evolved into highly competitive and dynamic environments for developers. While the traditional app life cycle involves incremental updates for feature enhancements and issue resolution, some apps deviate from this norm by undergoing significant transformations in their use cases or market positioning. We define this previously unstudied phenomenon as 'app metamorphosis'. In this paper, we propose a novel and efficient multi-modal search methodology to identify apps undergoing metamorphosis and apply it to analyse two snapshots of the Google Play Store taken five years apart. Our methodology uncovers various metamorphosis scenarios, including re-births, re-branding, re-purposing, and others, enabling comprehensive characterisation. Although these transformations may register as successful for app developers based on our defined success score metric (e.g., re-branded apps performing approximately 11.3% better than an average top app), we shed light on the concealed security and privacy risks that lurk within, potentially impacting even tech-savvy end-users.


Multimodal Chain-of-Thought Reasoning via ChatGPT to Protect Children from Age-Inappropriate Apps

arXiv.org Artificial Intelligence

Mobile applications (Apps) could expose children to inappropriate themes such as sexual content, violence, and drug use. Maturity rating offers a quick and effective method for potential users, particularly guardians, to assess the maturity levels of apps. Determining accurate maturity ratings for mobile apps is essential to protect children's health in today's saturated digital marketplace. Existing approaches to maturity rating are either inaccurate (e.g., self-reported rating by developers) or costly (e.g., manual examination). In the literature, there are few text-mining-based approaches to maturity rating. However, each app typically involves multiple modalities, namely app description in the text, and screenshots in the image. In this paper, we present a framework for determining app maturity levels that utilize multimodal large language models (MLLMs), specifically ChatGPT-4 Vision. Powered by Chain-of-Thought (CoT) reasoning, our framework systematically leverages ChatGPT-4 to process multimodal app data (i.e., textual descriptions and screenshots) and guide the MLLM model through a step-by-step reasoning pathway from initial content analysis to final maturity rating determination. As a result, through explicitly incorporating CoT reasoning, our framework enables ChatGPT to understand better and apply maturity policies to facilitate maturity rating. Experimental results indicate that the proposed method outperforms all baseline models and other fusion strategies.


Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning

arXiv.org Artificial Intelligence

Mobile User Interface Summarization generates succinct language descriptions of mobile screens for conveying important contents and functionalities of the screen, which can be useful for many language-based application scenarios. We present Screen2Words, a novel screen summarization approach that automatically encapsulates essential information of a UI screen into a coherent language phrase. Summarizing mobile screens requires a holistic understanding of the multi-modal data of mobile UIs, including text, image, structures as well as UI semantics, motivating our multi-modal learning approach. We collected and analyzed a large-scale screen summarization dataset annotated by human workers. Our dataset contains more than 112k language summarization across $\sim$22k unique UI screens. We then experimented with a set of deep models with different configurations. Our evaluation of these models with both automatic accuracy metrics and human rating shows that our approach can generate high-quality summaries for mobile screens. We demonstrate potential use cases of Screen2Words and open-source our dataset and model to lay the foundations for further bridging language and user interfaces.


Android Security using NLP Techniques: A Review

arXiv.org Artificial Intelligence

Android is among the most targeted platform by attackers. While attackers are improving their techniques, traditional solutions based on static and dynamic analysis have been also evolving. In addition to the application code, Android applications have some metadata that could be useful for security analysis of applications. Unlike traditional application distribution mechanisms, Android applications are distributed centrally in mobile markets. Therefore, beside application packages, such markets contain app information provided by app developers and app users. The availability of such useful textual data together with the advancement in Natural Language Processing (NLP) that is used to process and understand textual data has encouraged researchers to investigate the use of NLP techniques in Android security. Especially, security solutions based on NLP have accelerated in the last 5 years and proven to be useful. This study reviews these proposals and aim to explore possible research directions for future studies by presenting state-of-the-art in this domain. We mainly focus on NLP-based solutions under four categories: description-to-behaviour fidelity, description generation, privacy and malware detection.


Evaluating Usage of Images for App Classification

arXiv.org Machine Learning

App classification is useful in a number of applications such as adding apps to an app store or building a user model based on the installed apps. Presently there are a number of existing methods to classify apps based on a given taxonomy on the basis of their text metadata. However, text based methods for app classification may not work in all cases, such as when the text descriptions are in a different language, or missing, or inadequate to classify the app. One solution in such cases is to utilize the app images to supplement the text description. In this paper, we evaluate a number of approaches in which app images can be used to classify the apps. In one approach, we use Optical character recognition (OCR) to extract text from images, which is then used to supplement the text description of the app. In another, we use pic2vec to convert the app images into vectors, then train an SVM to classify the vectors to the correct app label. In another, we use the captionbot.ai tool to generate natural language descriptions from the app images. Finally, we use a method to detect and label objects in the app images and use a voting technique to determine the category of the app based on all the images. We compare the performance of our image-based techniques to classify a number of apps in our dataset. We use a text based SVM app classifier as our base and obtained an improved classification accuracy of 96% for some classes when app images are added.


Here's how Google Play is using AI to improve search - Memeburn

#artificialintelligence

Here's how Google Play is using AI to improve search Trying to find an app on the Google Play Store can be an exercise in frustration, especially if it's an eagerly anticipated app or new release. Things don't get much better for mega-popular apps, as the search results are often cluttered with irrelevant results. Fortunately, Google is working on a solution, using machine learning to get better results. "Searches by topic require more than simply indexing apps by query terms; they require an understanding of the topics associated with an app," the team of software engineers wrote. The work required machine-learning approaches, but one big challenge for machine learning was the size of the data-set to work with.


A Short Introduction to Using Word2Vec for Text Classification

#artificialintelligence

Machine learning applications on natural language are an extremely important tool in the data scientist's toolbox. Use cases can include auto-detecting the language of a website, detecting spam in your spam filter, or auto-completing search queries. When you're working with text data, an important use case is text classification, where the data scientist is tasked with creating an algorithm that can figure out what a bit of text is all about (what is the tagline) based on what is written in the document. This can be used in a myriad of examples we see everyday, tagging things such as blog articles, app descriptions, and reviews. In many cases traditional text classification can be difficult to scale, because as the order of the taxonomy count increases, the amount of training required increases as well.